{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\nAuto-scheduling Sparse Matrix Multiplication on CPU with Custom Sketch Rule\n===========================================================================\n**Author**: `Chengfan Jia <https://github.com/jcf94/>`_\n\nThis is a tutorial on how to use the auto-scheduler to tune a sparse matrix multiplication for\nCPUs.\n\nAuto-scheduler is designed to explore the schedule with best performance for a given computation\ndeclaration automatically. While sometimes, we may have a demand to try some special ops which may\nnot been well-supported by auto-scheduler's default sketch rules and result in poor performance.\nFortunately, auto-scheduler currently allows user to provide a CustomSketch to cover these cases.\n\nWe use sparse matrix multiplication as an example in this tutorial to demonstrate how to implement\nand plug a custom sketch rule to the auto-scheduler's search policy.\n\nNote that this tutorial will not run on Windows or recent versions of macOS. To\nget it to run, you will need to wrap the body of this tutorial in a :code:`if\n__name__ == \"__main__\":` block.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import os\n\nimport numpy as np\nimport tvm\nimport tvm.testing\nfrom tvm import te, auto_scheduler, runtime, topi\nfrom tvm.auto_scheduler import _ffi_api\nfrom tvm.topi.utils import get_const_tuple\nfrom tvm.topi.sparse.utils import random_bsr_matrix"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Define the computation\n^^^^^^^^^^^^^^^^^^^^^^\nTo begin with, let us define the computation of a sparse matmul with several relu and bias add.\nThe function should return the list of input/output tensors.\nFrom these tensors, the auto-scheduler can get the whole computational graph.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "@auto_scheduler.register_workload\ndef sparse_dense(M, N, K, w_data_shape, w_indices_shape, w_indptr_shape, dtype):\n    X = te.placeholder(shape=(M, K), dtype=dtype)\n    W_data = te.placeholder(shape=w_data_shape, dtype=dtype)\n    W_indices = te.placeholder(shape=w_indices_shape, dtype=\"int32\")\n    W_indptr = te.placeholder(shape=w_indptr_shape, dtype=\"int32\")\n    B = te.placeholder(shape=(M, N), dtype=dtype)\n\n    out = topi.nn.sparse_dense(topi.nn.relu(X), W_data, W_indices, W_indptr)\n    out = te.compute((M, N), lambda i, j: out[i, j] + B[i, j], name=\"BiasAdd\")\n    out = topi.nn.relu(out)\n\n    return [X, W_data, W_indices, W_indptr, B, out]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Special step for sparse workload\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nDuring schedule tuning, auto-scheduler will use random inputs to measure the performance of a\ngenerated schedule. While we cannot directly use a random array as the input of a sparse op, for\nthe \"indices\" and \"indptr\" array are meaningful for the computation.\n\nTo solve this problem, we register these as special buffers, and load them when process program\nmeasuring.\nSee the `tvm.auto_scheduler.measure.py` for more details.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Define the basic shapes of this sparse computation\nM = 128\nK = 256\nN = 512\nBS_R = 16\nBS_C = 1\ndensity = 0.6\n\n# Generate the test data with numpy\nX_np = np.random.randn(M, K).astype(\"float32\")\nX_np = np.maximum(np.zeros((M, K), dtype=\"float32\"), X_np)  # Relu\nW_sp_np = random_bsr_matrix(N, K, BS_R, BS_C, density=density, dtype=\"float32\")\nW_np = W_sp_np.todense()\nY_np = X_np @ W_np.T  # Process the matrix multiplication\nB_np = np.random.randn(M, N).astype(\"float32\")\nY_np = Y_np + B_np  # Bias add\nY_np = np.maximum(np.zeros((M, N), dtype=\"float32\"), Y_np)  # Relu"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Create the search task\n^^^^^^^^^^^^^^^^^^^^^^\nWe then create a search task with M=N=K=512 and dtype=\"float32\"\nIf your machine supports avx instructions, you can\n\n  - replace \"llvm\" below with \"llvm -mcpu=core-avx2\" to enable AVX2\n  - replace \"llvm\" below with \"llvm -mcpu=skylake-avx512\" to enable AVX-512\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "target = tvm.target.Target(\"llvm\")\n\n# Register the sparse data to task inputs\nprefix = \"sparse_dense_bsr_%d_%d_%d_%d_%d_%d_\" % (\n    N,\n    K,\n    BS_R,\n    BS_C,\n    W_sp_np.indices.shape[0],\n    W_sp_np.indptr.shape[0],\n)\ntask = tvm.auto_scheduler.SearchTask(\n    func=sparse_dense,\n    args=(M, N, K, W_sp_np.data.shape, W_sp_np.indices.shape, W_sp_np.indptr.shape, \"float32\"),\n    target=target,\n    task_inputs={\n        prefix + \"W_data\": runtime.ndarray.array(W_sp_np.data),\n        prefix + \"W_indices\": runtime.ndarray.array(W_sp_np.indices),\n        prefix + \"W_indptr\": runtime.ndarray.array(W_sp_np.indptr),\n    },\n    task_inputs_save_to_file=True,\n)\n\n# Inspect the computational graph\nprint(\"Computational DAG:\")\nprint(task.compute_dag)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Write the custom sketch for sparse dense op\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nBefore tuning, we will need to define the CustomSketchRule for the sparse dense op.\n\nCustomSketchRule consists of two parts: the condition function and the apply function.\n\n  - condition function: describe when to apply this sketch rule. For example, we can only apply\n    the rule to the sparse ops by matching their name and tag.\n  - apply function: describe how to generate the initial sketch. You can implement it using\n    auto-scheduler provided loop state APIs.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def meet_condition_func(search_policy, state, stage_id):\n    state = auto_scheduler.loop_state.State(state, search_policy.search_task.compute_dag)\n    if state.stages[stage_id].op.tag in [\n        \"sparse_dense_sp_rhs_bsrmm\",\n        \"sparse_dense_sp_rhs_bsrmm_block\",\n    ]:\n        return auto_scheduler.PreloadCustomSketchRule.APPLY_AND_SKIP_REST\n    else:\n        return auto_scheduler.PreloadCustomSketchRule.PASS\n\n\ndef apply_func(search_policy, state, stage_id):\n    ret = []\n    s0 = auto_scheduler.loop_state.State(state, search_policy.search_task.compute_dag)\n    if s0.stages[stage_id].op.tag == \"sparse_dense_sp_rhs_bsrmm_block\":\n        return [s0.state_object, stage_id - 1]\n\n    sparse_dense = s0.stages[stage_id].op\n    sparse_dense_block = s0.stages[stage_id - 1].op\n    assert sparse_dense.tag == \"sparse_dense_sp_rhs_bsrmm\"\n    assert sparse_dense_block.tag == \"sparse_dense_sp_rhs_bsrmm_block\"\n\n    # Set the default consumer of compute block\n    consumer = sparse_dense\n\n    # If sparse dense has a single elementwise consumer\n    # We can compute inline the sparse_dense output stage\n    consumers = _ffi_api.SearchPolicyUtilsGetConsumers(\n        search_policy.search_task, s0.state_object, stage_id\n    )\n    if len(consumers) == 1:\n        consumer_id = int(consumers.items()[0][0])\n        if _ffi_api.SearchPolicyUtilsIsElementwiseMatch(\n            search_policy.search_task, s0.state_object, stage_id, consumer_id\n        ):\n            consumer = s0.stages[consumer_id].op\n            s0.compute_inline(sparse_dense)\n\n    i, nb_j, j, row_offset, c = s0[sparse_dense_block].iters\n    m, n = s0[consumer].iters\n    i0, i1, i2 = s0.split(sparse_dense_block, i, [None, None])\n    m0, m1 = s0.follow_split(consumer, m, len(s0.transform_steps) - 1, 1)\n    j0, j1 = s0.split(sparse_dense_block, nb_j, [None])\n    n0, n1 = s0.follow_split(consumer, n, len(s0.transform_steps) - 1, 1)\n    s0.reorder(sparse_dense_block, [i0, j0, i1, j1, row_offset, i2, j, c])\n    s0.reorder(consumer, [m0, n0, m1, n1])\n    s0.compute_at(sparse_dense_block, consumer, n0)\n\n    ret.append([s0.state_object, stage_id - 2])\n\n    return ret"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we set parameters for the auto-scheduler with the custom sketch plugged in.\n\n* :code:`num_measure_trials` is the number of measurement trials we can use during the search.\n  We only make 10 trials in this tutorial for a fast demonstration. In practice, 1000 is a\n  good value for the search to converge. You can do more trials according to your time budget.\n* In addition, we use :code:`RecordToFile` to dump measurement records into a file\n  `sparse_dense.json`.\n  The measurement records can be used to query the history best, resume the search,\n  and do more analyses later.\n* see :any:`auto_scheduler.TuningOptions` for more parameters\n* Here, we need to create a :code:`auto_scheduler.SketchPolicy` object, and add the custom sketch\n  rule as a `init_search_callbacks`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "log_file = \"sparse_dense.json\"\ntune_option = auto_scheduler.TuningOptions(\n    num_measure_trials=10,\n    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],\n    verbose=2,\n)\n\nsearch_policy = auto_scheduler.SketchPolicy(\n    task,\n    program_cost_model=auto_scheduler.XGBModel(),\n    init_search_callbacks=[\n        auto_scheduler.PreloadCustomSketchRule(meet_condition_func, apply_func, \"SparseDense\")\n    ],\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run the search\n^^^^^^^^^^^^^^\nNow we get all inputs ready.\nWe can kick off the search and let the auto-scheduler do its magic.\nAfter some measurement trials, we can load the best schedule from the log\nfile and apply it.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Run auto-tuning (search)\n# Notice: We do not run the tuning in our webpage server since it takes too long.\n# Uncomment the following line to run it by yourself.\ntask.tune(tune_option, search_policy)\n\n# Apply the best schedule\nsch, args = task.apply_best(log_file)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can lower the schedule to see the IR after auto-scheduling.\nThe auto-scheduler correctly performs optimizations including multi-level tiling,\nlayout transformation, parallelization, vectorization, unrolling, and operator fusion.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"Lowered TIR:\")\nprint(tvm.lower(sch, args, simple_mode=True))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Check correctness and evaluate performance\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nWe build the binary and check its correctness and performance.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "func = tvm.build(sch, args, target)\n\ndev = tvm.cpu()\n\nX_tvm = tvm.nd.array(X_np, device=dev)\nW_data_tvm = tvm.nd.array(W_sp_np.data, device=dev)\nW_indices_tvm = tvm.nd.array(W_sp_np.indices, device=dev)\nW_indptr_tvm = tvm.nd.array(W_sp_np.indptr, device=dev)\nB_tvm = tvm.nd.array(B_np, device=dev)\nY_tvm = tvm.nd.empty(Y_np.shape, device=dev)\n\nfunc(X_tvm, W_data_tvm, W_indices_tvm, W_indptr_tvm, B_tvm, Y_tvm)\n\n# Check results\ntvm.testing.assert_allclose(Y_np, Y_tvm.numpy(), atol=1e-4, rtol=1e-4)\n\n# Evaluate execution time.\nevaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500)\nprint(\n    \"Execution time of this operator: %.3f ms\"\n    % (\n        np.median(evaluator(X_tvm, W_data_tvm, W_indices_tvm, W_indptr_tvm, B_tvm, Y_tvm).results)\n        * 1000\n    )\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>Tuning result example\n\n  .. code-block:: c\n\n   ----------------------------------------------------------------------\n   Lowered TIR:\n   primfn(placeholder_5: handle, placeholder_6: handle, placeholder_7: handle, placeholder_8: handle, placeholder_9: handle, compute_1: handle) -> ()\n     attr = {\"global_symbol\": \"main\", \"tir.noalias\": True}\n     buffers = {placeholder_2: Buffer(placeholder_10: Pointer(float32), float32, [9831, 16, 1], []),\n                placeholder_4: Buffer(placeholder_11: Pointer(int32), int32, [33], []),\n                placeholder_3: Buffer(placeholder_12: Pointer(float32), float32, [512, 512], []),\n                compute: Buffer(compute_2: Pointer(float32), float32, [512, 512], []),\n                placeholder_1: Buffer(placeholder_13: Pointer(float32), float32, [512, 512], []),\n                placeholder: Buffer(placeholder_14: Pointer(int32), int32, [9831], [])}\n     buffer_map = {placeholder_7: placeholder, placeholder_9: placeholder_1, placeholder_6: placeholder_2, compute_1: compute, placeholder_5: placeholder_3, placeholder_8: placeholder_4} {\n     for (i0.outer.i1.outer.fused: int32, 0, 1024) \"parallel\" {\n       attr [compute_3: Pointer(float32)] \"storage_scope\" = \"global\";\n       allocate(compute_3, float32, [256]) {\n         for (nb_j.inner: int32, 0, 2) {\n           for (i.inner.init: int32, 0, 8) {\n             for (j.init: int32, 0, 16) {\n               compute_3[(((i.inner.init*32) + (nb_j.inner*16)) + j.init)] = 0f32\n             }\n           }\n           for (elem_idx: int32, 0, ((int32*)placeholder_11[(((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner) + 1)] - (int32*)placeholder_11[((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)])) {\n             for (i.inner: int32, 0, 8) {\n               for (j: int32, 0, 16) {\n                 compute_3[(((i.inner*32) + (nb_j.inner*16)) + j)] = ((float32*)compute_3[(((i.inner*32) + (nb_j.inner*16)) + j)] + ((float32*)placeholder_10[((((int32*)placeholder_11[((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)]*16) + (elem_idx*16)) + j)]*max((float32*)placeholder_12[(((floordiv(i0.outer.i1.outer.fused, 16)*4096) + (i.inner*512)) + (int32*)placeholder_14[((int32*)placeholder_11[((floormod(i0.outer.i1.outer.fused, 16)*2) + nb_j.inner)] + elem_idx)])], 0f32)))\n               }\n             }\n           }\n         }\n         for (i0.inner: int32, 0, 8) {\n           compute_2[ramp((((floordiv(i0.outer.i1.outer.fused, 16)*4096) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 16)*32)), 1, 32)] = max(((float32x32*)compute_3[ramp((i0.inner*32), 1, 32)] + (float32x32*)placeholder_13[ramp((((floordiv(i0.outer.i1.outer.fused, 16)*4096) + (i0.inner*512)) + (floormod(i0.outer.i1.outer.fused, 16)*32)), 1, 32)]), broadcast(0f32, 32))\n         }\n       }\n     }\n   }</p></div>\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}